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Abstract 

Background: Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. 
The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology 
(GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal 
clustering solution still remains an unsolved problem. 

Results: We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection 
onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a 
corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to 
find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC 
methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. 
To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log 
Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. 
Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of- 
the-art MGC methods. 

Conclusions: The POCS-based MGC method can successfully combine multiple constraints from different nature 
for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions. 



Background 

Computational annotating gene functions is a funda- 
mental issue in bioinformatics. Microarray gene expres- 
sion data have been used widely to study the cell cycle 
system, genetic regulatory interactions, development at 
the molecular level, and genes that act in response to a 
certain infectious disease. To determine gene functions, 
a basic approach is gene clustering using gene expres- 
sion data based on the assumption that genes with simi- 
lar expression patterns should share similar functions in 
the process. Typical gene clustering methods include 
hierarchical clustering [1], the k-means algorithm [2], 
self-organizing maps [3], the fuzzy c-means algorithm 
[4], and hidden Markov models [5]. However, gene clus- 
tering regularized by only single constraint of gene 
expression is not enough to obtain biologically reliable 
clusters, because microarray data are often noisy, con- 
tain missing values, and have uncertain temporal 
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dependencies in time-series data [6,7]. Therefore, other 
constraints besides gene expression data should be 
incorporated for the robust and reliable gene clustering. 

Recent multiconstrained gene clustering (MGC) meth- 
ods have attracted much more interests [8-13]. The 
basic idea is that multiple constraints such as Gene 
Ontology (GO) and metabolic network structures can 
prevent gene clustering from falling into the locally opti- 
mal solution space constrained by noisy gene expression 
data alone. One key problem is how to combine multi- 
ple pieces of constraints to find a consistent clustering 
solution. Current MGC methods adopt a linear combi- 
nation strategy to integrate multiple constraints of the 
same nature into a single new constraint, so that stan- 
dard clustering algorithms for single-constrained gene 
clustering problems can be used, e.g., hierarchical clus- 
tering [8], Gaussian mixture models [9], k-medoids [10], 
and iterative conditional modes (ICM) for Markov ran- 
dom fields [12]. More specifically, they build a distance 
matrix of gene expression data as the first constraint, 
and then build another distance matrix based on either 
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metabolic pathway [8,12,14] or GO annotations [9,10] as 
the second constraint. These two constraints of distance 
matrices are added linearly to form the new distance 
matrix for gene clustering. This linear combination 
strategy has also been used to incorporate different con- 
straints in document clustering [15,16]. Despite good 
clustering performance, there are two major problems 
yet to be solved. The first is that these MGC methods 
can only combine constraints of the same nature, i.e., all 
constraints have to be represented as distance matrices. 
If one constraint is a similarity matrix, we need to trans- 
form it into a distance matrix so that we can add it up 
to other distance matrices. Such transformation may 
distort the original constraint with information loss. 
Even if we have two distance matrices, the distance 
values may be in different scales and cannot be added 
directly. The second problem lies in the linear combina- 
tion of the constraint matrices. In most cases, the 
desired combined constraint does not necessarily have a 
simple linear relationship with all other original con- 
straints. In addition, the weights for the linear combina- 
tion often need a reasonable justification in practice. 
Another MGC strategy is the GO-guided fuzzy c-means 
(FCM) algorithm [13], which uses GO annotations to 
initialize and update the cluster probability of each gene. 

To overcome above problems, we propose a novel 
MGC method within the generalized projection frame- 
work, which is a generalization of the projection onto 
convex sets (POCS) technique, which has found many 
applications in image reconstruction [17] and microar- 
ray missing value imputation [18]. Theoretically, POCS 
provides a flexible framework to integrate multiple 
pieces of constraints for an optimal solution. It first 
transforms each constraint into a corresponding convex 
set, and then uses an iteratively convergent procedure to 
find a solution in the intersection of all sets. POCS can 
integrate constraints from different nature such as dif- 
ferent similarity matrices. Indeed, it often handles differ- 
ent constraints in frequency and spatial domains in 
image reconstruction problems. Another advantage is 
that the original constraints remain intact. The cluster- 
ing result is projected onto the solution set that satisfies 
each constraint iteratively and the final result may lie in 
the intersection set that satisfies a nonlinear combina- 
tion of the original constraints. Without loss of general- 
ity, in this paper we consider two major types of 
constraints: the gene expression similarity [8] and the 
GO-based semantic similarity [19]. POCS produces a 
regularized clustering result that may be more reliable 
than those solely dependent on either the gene expres- 
sion similarity or the GO semantic similarity due to the 
fact that expression data are often short and noisy, 
while GO terms may be inaccurate and mis-annotated. 
Because in most cases the solution set is nonconvex, we 



adopt the generalized projections similar to the POCS 
procedure. To minimizes the distance between the can- 
didate solution and the constraint set, we design the 
generalized projector based on a method similar to the 
relaxation labeling (RL) algorithm [20,21], which has 
been used for the approximate inference for Markov 
random fields [22,23]. 

Usually genes have multiple functions and can be 
assigned into more than one group. Traditional gene 
clustering algorithms often use a hard clustering strategy 
that assigns genes into only one group. Recent MGC 
methods relax this limitation and allows genes to be 
assigned into several groups [9,10,13]. To take this situa- 
tion into account, we use a soft clustering strategy in 
which genes are assigned to all clusters with different 
probabilities. Based on soft clustering results, we pro- 
pose a new performance measure "gene log likelihood" 
(GLL) to measure the distance between the predicted 
clustering result and the reference clusters. This mea- 
sure has also been widely applied to evaluating word 
clustering performance in topic modeling problems [24]. 
To confirm the effectiveness, we evaluate the POCS- 
based MGC method on the yeast gene expression data- 
set, and compare the clustering results with recent 
MGC methods such as k-medoids [10], ICM [12] and 
FCM [13]. Experimental results demonstrate that the 
POCS-based MGC can enhance the overall clustering 
performance by a large margin. 

This paper is organized as follows. In the next section 
we propose the POCS-based MGC method and the RL- 
based generalized projector to minimize the distance 
between clustering solution to the corresponding con- 
strained solution set. To account for genes in multiple 
clusters, we also propose GLL for calculating the dis- 
tance between the predicted soft clustering results and 
the reference gene clusters. The result section shows 
comparative experimental results on different yeast 
expression datasets. The POCS-based MGC algorithm 
always converges to the optimal solution in practice. 
Finally, we draw conclusions and envision future work. 

Methods 

Gene clustering is a labeling problem, in which a set of 
cluster labels are assigned to genes for annotating gene 
functions. Given / genes and K clusters, the soft cluster- 
ing solution is a matrix X = (x ik ), 1 < i < I, 1 < k < K, 
where x ik e [0, 1] and *L k x ik = 1. The element x ik is the 
probability that the ith gene is associated with the Ath 
cluster label. For each gene we use a probability vector 
X,- = (x a , . . ., x ik , . . ., x iK ) to represent its cluster label- 
ing configuration. From this perspective, the clustering 
solution X is the cluster labeling configuration of 
/ genes over K clusters. We may also use the winner- 
take-all strategy to figure out the hard clustering 
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solution X", in which the z'th gene belongs to only one 
cluster with the highest probability, i.e., k* = arg max^ 
Xik and %ik* = 1. 

Gene expression constraint 

Based on microarray gene expression profiles, we can 
build the first constraint using the similarity matrix for 
gene clustering. The metric can be the Pearson's corre- 
lation coefficient and Euclidean distance [8-10], or the 
more complex type-2 fuzzy hidden Markov model-based 
sequence similarity [25]. Because the Pearson's correla- 
tion coefficient is suitable for time-series gene expres- 
sion data [26], we adopt it for calculating the similarity 
between two genes' log-ratio transformed profiles [8], 
i.e., the logarithm of the ratio between each sample 
point in the profile and a control measurement. More 
specifically, given two genes' transformed profiles gi(m) 
and grim) in length M, the correlation coefficient v u - is 



„i = JLV SiMzMi 

m=l V 



a; 



( gi'{m)-ni> A 

or 



where ^, and cr, denote mean and standard deviation 
of the transformed profile of the z'th gene respectively. 
The correlation coefficient value v\> e [-1, 1], where 
the higher value corresponds to the higher similarity 
between two genes' profiles. Here we consider the anti- 
correlated gens as most dissimilar because the correlated 
genes often involve in similar reaction steps and share 
similar functions. Therefore, the Pearson's correlation 
coefficient matrix vh constrains the first clustering 
solution set Q = {X e }, which contains many locally opti- 
mal clustering solutions satisfying y\, . 

GO constraint 

As an important source of biological knowledge, the 
Gene Ontology (GO) provides a consistent description 
of genes and gene products by a controlled and struc- 
tured vocabulary, which includes three major categories: 
biological process (BP), molecular function (MF), and 
cellular component (CC). The GO terms are organized 
in the form of a directed acyclic graph (DAG) with two 
major semantic relations such as "is-a" and "part-of", 
where "A is-a B" means A is a subclass of B, and "C 
part-of D" means C is always part of D. Generally, sim- 
ply identifying the shared GO annotations of gene pro- 
ducts for their functional relationship has the following 
limitations. First, two quite different GO annotations 
can be closely related through their common ancestors 
in the DAG so as to have a higher semantic similarity. 
Second, the shared GO terms may be too general to 



describe the functional association of annotated gene 
products. Recently, the GO-based semantic similarity 
measures have been applied to searching semantically 
similar proteins [27], clustering gene expression data 
and assessing cluster validity [19,28,29], developing new 
human regulatory pathway modeling tools [30], validat- 
ing protein interaction data [31], validating functional 
annotation of expression-based clusters [32], and 
enabling the identification of functionally related gene 
products independent of homology [33]. 

The GO-based semantic similarity measures assume 
that the more information two GO terms share, the 
more similar they are. In this paper we adopt a recent 
GO-based semantic measure proposed by Wang et al. 
[19], in which the similarity between two GO terms S GO 
(c m , c„) is calculated according to the graph structural 
information encoded in the GO. This semantic measure 
between annotated GO terms for genes has been 
demonstrated to be better than the classic Resnik's mea- 
sure in clustering gene products. If c is a GO term, C is 
the set of GO terms including term c and all its ances- 
tors, and E c is the set of edges connecting all terms in 
C, the S-value of any term t in the graph DAG C = (c, C, 
E c ) related to the term c, S c (t), is defined as, 



S c {c) = 1, 

S c (t) = max[u/ e xS c (t') | c'e children(t)],if t ± c, 



where w e is the semantic contribution factor for edge 
e e E c linking the term t with its child term t'. Here we 
use w e = 0.8 for "is-a" relation and w e = 0.6 for "part-of 
relation as suggested in [19]. After obtaining all S-values 
for all terms in the DAG C , the semantic value of the 
term c, SV (c), is 



teC 

Given two GO terms C\ and c 2 as well as their graphs 
DAG(c 1 ,C 1 ,£ Ci ) and DAG(c 2 ,C 2 , E C J , the semantic 
similarity S GO (ci, c 2 ) is 

*G0(C V C 2 )= - SV(Ci)+SV(c2) 

where S Cl (t) is the S-value of GO term t related to 
term c v and S C2 (t) is the S-value of GO term t related 
to term c 2 . One gene may be annotated by many 
GO terms. Given two genes annotated by several GO 
terms, GO, = {c a , . . ., c im , . . ., c iM } and GO r = {c n , . . ., 
c-i'm ■ ■ •» Ci' N }, the functional similarity between genes, 
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4 = 



l<m<M 



max S GO (c jm ,c r „) + 

l<n<N 



l<n<N 



max S GO (c," n ,c jm ) 

l<m<M 



/(M + N). 



Note that the functional similarity v, 2 ' between two 
GO term sets GO, and G 0l ' considers the hierarchical 
structure of GO terms c based on the S-value. Because 
the GO contains three main vocabularies, BP, MF and 
CC, the GO similarity value between genes can be cal- 
culated in a joint manner as 

v\> = [BPsim 2 + MFsim 2 + CCsim 2 ] / 3, 

where BPsim, MFsim and CCsim denote the similarity 
values v\' of the corresponding GO terms within the 

same type. The similarity value y\' e [0,1], where the 
higher value corresponds to the higher similarity. As a 
result, the GO-based semantic similarity v\> constrains 
the second clustering solution set C 2 = {X g }, which contains 
many locally optimal clustering solutions satisfying y\- ■ 

Generalized projections 

Although the gene expression and GO-based semantic 
similarity may achieve a clustering solution with a high 
correlation, there is still a large amount of complemen- 
tary information between their final clustering results 
[34]. Both gene expression and GO constrained solution 
sets Ci = {X e } and C 2 = {Xgj may not contain a single 
globally optimal solution, and even they contain such a 
solution, we are unlikely able to find it since the optimi- 
zation procedures are highly nonlinear. So, we consider 
Ci and C 2 as sets of all locally optimal solutions under 
different constraints. When both constraints are satis- 
fied, we eliminate many unreasonable locally optimal 
solutions and obtain an improved clustering perfor- 
mance. Our objective is to find the biologically consis- 
tent clustering solution X f e C 1 n C 2 using the POCS 
procedure [17]. Note that direct adding two constraints 
v\, and vfi' based on the weight w e [0, 1], i.e., 
(1 - w)v\' + wvfi> , to produce the new constraint for 
gene clustering is not suitable because the constraints 
are from different nature. In contrast, the POCS frame- 
work decomposes the optimization procedure into dif- 
ferent projections and solves the problem efficiently. 

input: Xo, P„, w„, 1 < n < N, M. 

output: X M . 

begin 

for m <— 1 to M do 

X ™ < -X"=l H '»( P " X m-l)i 



// P n X m ^ is described in Algorithm 2. 
end 
end 

Algorithm 1: The simultaneous projection. 

Within the POCS framework [17], each constraint on 
the solution is formulated as a corresponding closed 
convex set, C„, 1 < n < N, in the Hilbert space H. The 
optimal solution X + is included in the intersection set 
C 0 of all convex sets C n , 



X t GC 0 =p|C„. 



(1) 



n=l 



If C 0 is nonempty in Figure 1A, the successive projec- 
tions onto the convex sets, 



- Pjv^N-1 ■■•Pn ■■■^l^l^m-l' 



(2) 



will converge to a consistent solution in Co for any 
random initial value X 0 , where X m , 1 < m < M is the 
solution at the mth iteration. Eq. (2) shows that the cur- 
rent solution X m _! is projected to each set or constraint 
C n , 1 < n < N through the projector P„ successively in 
order to find the next better solution X m until it con- 
verges to the consistent solution X f in the intersection 
of all sets. Figure 1A shows the projection process for 
the consistent problem in Eq. (2), where the thick black 
dot represents a consistent solution in the intersection 
of two sets Ci and C 2 for the gene expression and GO 
constraints, respectively. The generalized projector P„ 
transforms X m .j into a solution x within the set C n 
that minimizes the distance between X m _x and x > 



P„X 



m-l 



= mm 

XgC„ 



Xm-i X 



(3) 



where 1 1 • 1 1 denotes the norm in the Hilbert space H. 
Indeed, Eq. (3) indicates that we need to transforms the 
current clustering solution X m _ x into a more suitable 
clustering solution x based on the similarity or dis- 
tance matrix v% for the set C„. If C 0 is empty in Figure 
IB, the POCS algorithm uses simultaneous projections, 



X m =^u/„(P„X m _ 1 ), 



(4) 



n=l 



where w n is the weight on the projections satisfying 

E N w„ = 1 and w„ > 0 for all n. The simultaneous 
ti=i 

projections converge weakly to a solution such that a 
weighted set distance function is minimized. Note that 
the simultaneous projections only linearly combine the 
solutions projected onto all constraint sets, which is 
more reasonable than the strategy that linearly combines 
constraints and then finds a solution under the new 
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Figure 1 (A) The consistent problem in Eq. (2), where the intersection set C 0 is nonempty. The circle is the initial solution. The thick black 
point is the consistent solution in the intersection of two sets for gene expression and GO constraints, respectively. POCS ensures that the initial 
solution will converge to the consistent solution after enough projections represented by the arrows. (B) The inconsistent problem in Eq. (4), 
where the intersection set C 0 is empty. After enough simultaneous projections represented by the arrows, the thick black dot is the approximate 
solution such that a weighted set distance from gene expression and GO constraints is minimized. 



constraint. Figure IB shows the simultaneous projec- 
tions for the inconsistent problem in Eq. (4), where the 
thick black dot is an approximately best solution mini- 
mizing the weighted set distance from gene expression 
constraint C\ and GO constraint C2, respectively. 

In practice, both Cj and C 2 are often nonconvex. A set 
is convex if and only if AX a + (1 - A)X fe is in the set when 
X a and X & are in the set for 0 < A < 1. The constraint sets 
contain many locally optimal clustering "solutions" and 
the interpolation of the solutions, i.e., the weighted sum 
AX fl + (1 -A)X/,, has no mathematical meaning. Thus, we 
cannot use the classic POCS procedure (2). Nevertheless, 
we can still use the generalized projections (3) to solve 
the problem within the POCS framework [[17], Chapter 
5], which do not require the sets be convex. In practice it 
is difficult to minimize the distance functions (3) under 
both constraints at the same time, so we do it iteratively 
based on generalized projections. The generalized projec- 
tor iteratively minimizes the distance function (3), and 
will terminate if the distance in the next step cannot 
decrease. From the regularization point of view, the solu- 
tion is regularized under different constraints 



simultaneously, and the final solution is a linear combi- 
nation of each regularized solution in Eq. (4). The simul- 
taneous projection weights w n can be fixed empirically 
according to prior knowledge. To summarize, Algorithm 
1 shows the simultaneous projection algorithm, 
input: x 1 = (4),<<, l<i,i' <I,l<k< K, J. 
output: X ; = {x J ik ) , 1 < i < I, 1 < k < K. 
begin 

for j <— 1 to / do 
for i <— 1 to / do 
for k <- 1 to K do 

X,r e a. ex PWK'fe; 

end 
end 
end 
end 

Algorithm 2: The relaxation labeling projector. 

Now we design the generalized projector based on the 
iterative RL algorithm [20,21,23], which can find the soft 
cluster label for each gene under a certain constraint. 
Given the clustering solution X and the constraint y% > 



ilk * 
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minimizing (3) is equivalent to maximizing the corre- 
sponding gain function, 



i _ 

S (X,<0 = ^^exp«0x i x 1 1 



(5) 



i=l i'edi 



where i' e 3, is a set of neighbors of the ith gene, and 
the term exp( v\- ) increases with the similarity between 
two genes according to the constraint yjj, . The neigh- 
borhood system 9, is defined as the ten nearest genes V 
with top similarity values v," • The term exp(F," )x ( xj' 
encourages that if the genes have a high similarity value 
v% they also have a high similarity value in soft cluster 
labeling configurations. The RL algorithm iteratively 
updates the initial X 1 by the gradient ql of the gain 
function (5) until / reaches the fixed maximum number 
/ as shown in Algorithm 2. The value of / is determined 
experimentally to ensure that the gain function is maxi- 
mized. That is, after / iterations, the RL algorithm con- 
verges to the local maximum of the gain function in 
terms of X 7 . In the meanwhile, the distance function (3) 
is also minimized by X 7 , where X 7 is equivalent to x m 
(3). Algorithm 2 shows the projection of X 1 satisfying 
one constraint v\' ■ Note that / is the number of itera- 
tions of the RL-based projector in Algorithm 2, while M 
is the number of iterations in the simultaneous projec- 
tion in Algorithm 1. The RL-based projector is a fast 
algorithm and practically / = 5 is enough. 

Gene log likelihood 

If we have a reference gene clustering solution Y, we 
can calculate the distance between the predicted cluster- 
ing solution X and the standard reference Y for the per- 
formance evaluation. The reference clustering solution 
is a matrix, Y = (y IM ,), 1 < i < I, 1 < w < W, where y iw = 
1 denotes that the jth gene belongs to the wth cluster. 
The number of reference clusters W may not equal to 
the predicted number of clusters K in most cases. 
Because a gene may belong to multiple clusters due to 
multiple functions, the vector y,- = (y n , . . ., y iw , . . ., y iw ) 
may contain multiple ones for the i'th gene. 

Based on the hard clustering solution X s , we may 
quantify the distance between X* and Y by normalized 
mutual information (NMI), which has been widely used 
in a lot of applications to measure the performance of 
clustering methods [12,19]. In information theory, the 
mutual information is defined as a quantity to measure 
the amount of information shared between two random 
variables. If one set of clusters is more consistent with 
the other set of clusters, the mutual information 
between two sets of cluster labels becomes larger. Gen- 
erally, the mutual information is normalized because the 



range of the mutual information measures depends on 
the size of given sets of clusters. NMI is calculated as 



NMI = 



n w n k 



G w n w ln^)(Z fe nfeln^) 



where / is the number of genes, n w is the number of 
genes in the wth reference cluster, n k is the number of 
genes in the &th reference cluster, and n w k is the num- 
ber of genes in both wth reference cluster and &th pre- 
dicted cluster. If two sets of clusters are identical, NMI 
between them reaches the maximum value of one. 

However, NMI cannot be used if one gene may be in 
multiple clusters. So, we propose a new performance 
measure referred to as gene log likelihood (GLL) log P 
(Y|X) for gene clustering, which measures the likelihood 
in predicting a single gene in the reference cluster Y 
based on X. GLL has a simple meaning that the ith 
gene in the wth reference cluster Y is predicted with a 
likelihood proportional to the product of the likelihood 
that the wth cluster is generated by the kth cluster and 
the likelihood that the ith gene is generated by the kth 
cluster in X. Higher values are better, indicating the 
obtained clustering solution X has a higher likelihood to 
generate the reference gene clusters Y. Specifically we 
calculate GLL as follows, 



w 



logP(Y|X) = ]T^log(p u ,xn, 



(6) 



where x,- = (x a , . . ., x^, . . ., x iK ) is the probability dis- 
tribution over K clusters of the ith gene, i e w denotes 
the set of all genes in the wth reference cluster with y iw 
= 1, and p w = (p wl , . . ., p wk , . . ., p wK ) is the probability 
distribution of the wth reference cluster over K pre- 
dicted clusters. Empirically, this probability p wk can be 
estimated by 



-wk 



Pwk 



- wk 



wk' 



(7) 



(8) 



where we assume that the genes are conditionally 
independent in the generative process. Indeed, this is 
a standard performance measure for word clustering 
in the text mining [24], which indicates the empiri- 
cal likelihood in predicting a single word in a 
document. 
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Results and Discussion 

Datasets 

To calculate the gene expression constraint, we select 
four microarray time-series datasets [35], monitoring 
genome-wide mRNA levels for 6178 yeast Saccharo- 
myces cerevisiae open reading frames simultaneously 
using several different methods of synchronization 
including four datasets: alpha, cdcl5, cdc28 and elu 
datasets. Also we add the Hughes dataset [36] widely 
used in gene clustering [9,10], because it contains 300 
time points while a small number of missing values. The 
missing values in the microarray data are interpolated by 
the POCS-based reconstruction method [18], which uses 
multiple constraints such as synchronization loss. To 
calculate the GO constraint, the GO (version 20080225) 
and annotation (version 1.1384) databases of yeast are 
downloaded from the GO official website. The yeast 
annotation file includes 6345 gene products annotated 
with 77152 GO terms. 

To evaluate MGC methods for gene clustering, we 
generate two different sets of reference gene clusters 
with true cluster labels from KEGG [37] and SGD (Sac- 
charomyces Genome Database) http://www.yeastgen- 
ome.org/ referred to as KEGG clusters [12] and SGD 
clusters [19], respectively. The KEGG pathway maps are 
generally classified into six major categories including 
metabolism. We use ten subcategories under the meta- 
bolism category as KEGG clusters, which includes a 
total of 531 genes. Note that a gene can be in more 
than one cluster. Table 1 lists the KEGG clusters and 
the number of genes in the corresponding cluster. We 
also use the gene annotation and classification informa- 
tion in yeast biochemical pathways as SGD clusters. 
There are 142 pathways involved with 835 genes, among 
which only 26 pathways contain more than 10 genes, 
where a gene can be in more than one pathway. Table 2 
summarizes the list of pathway clusters and the number 
of genes in the corresponding cluster. The reason why 
we use two different sets of reference clusters lie in the 
fact that gene clusters are variable depending on the 

Table 110 reference gene clusters from KEGG 



Cluster name Number of Genes 

Amino acid metabolism 197 

Carbohydrate metabolism 189 

Metabolism of cofactors vitamins 47 

Energy metabolism 66 

Glycan biosynthesis and metabolism 21 

Lipid metabolism 74 

Nucleotide metabolism 103 

Metabolism of other amino acids 50 

Metabolism of secondary metabolites 18 

Xenobiotics biodegradation and metabolism 19 



different partitioning criteria. If the predicted clusters by 
the POCS-based method are close to both reference 
clusters, we may make a safe conclusion that this 
method is robust to annotate gene functions under dif- 
ferent conditions. 

Comparative results 

The POCS-based MGC method requires two key para- 
meters, the number of simultaneous projections M and 
the weight on projections w„, in Algorithm 1. Because 
we have two constraints, the weight for the GO-based 
constraint is w, and thus the weight for the gene expres- 
sion constraint is 1 - w. Through experiments on the 
alpha dataset, we can determine proper M and w for 
desirable gene clustering performance. The parameters 
M and w are adjusted so that we can obtain the desir- 
able result within the POCS framework. It is possible 
that another iterative method can estimate the para- 
meters better. However, in many cases, such a better- 
performing method is a supervised learning procedure 
using reference gene clusters, and can be incorporated 
into the POCS procedure to achieve an even better per- 
formance or robustness. That is, POCS is useful for 
combining information from different sources if we can 
formulate corresponding constraint sets and projections. 

To determine M, we randomly initialize the clustering 
solution, and the weight w = 0.5. Figure 2 shows the 
GLL values on the KEGG and SGD reference clusters 
when 10 projections are used. From different number of 
clusters K = 10, 15, 20, 25, we see that all GLL values 
do not increase significantly after two or three projec- 
tions. So, we believe that M = 3 is enough to produce 
desirable clustering results in this task. From this experi- 
ment, we also see that Algorithm 1 converges quickly 
after a few projections. Then, we fix M = 3 and tune the 
weight w e [0, 1]. By using M = 3 projections in prac- 
tice, POCS does not increase the computational cost 
very much, which makes this algorithm very attractive 
in combining more constraints for gene clustering. 

Figure 3 shows the GLL values on the KEGG and 
SGD reference clusters by increasing the weight at the 
step 0.1. We observe that the performance highly 
depends on different projection weights. If we use 
KEGG reference clusters, we find that weight w = 0.7 
can produce higher GLL value on average. The gene 
expression constraint alone w = 0 does not ensure the 
best clustering result, while the GO constraint alone w 
= 1 does not ensure the best clustering result either. We 
see that the GO constraint can produce more reliable 
clustering result than the gene expression constraint, 
because the GO annotation is based on prior knowledge 
of biologists more reliable than gene expression data. 
Furthermore, we often assume that anti-correlated genes 
are not within the same cluster, but in some cases this 
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Table 2 26 reference gene clusters from yeast biochemical pathways 



Cluster name Number of genes 

TCA cycle, aerobic respiration 24 

de novo biosynthesis of purine nucleotides 32 

de novo biosynthesis of pyrimidine deoxyribonucleotides 15 

de novo biosynthesis of pyrimidine ribonucleotides 12 

ergosterol biosynthesis 1 5 

fatty acid biosynthesis, initial steps 12 

fatty acid oxidation pathway 1 1 

folate biosynthesis 24 

folate interconversions 1 7 

folate polyglutamylation 13 

folate transformations 1 6 

gluconeogenesis 17 

glycolysis 14 

glyoxylate cycle 1 2 

inositol phosphate biosynthesis 14 

isoleucine degradation 13 

ipid-linked oligosaccharide biosynthesis 15 

pantothenate and coenzyme A biosynthesis 1 1 

phenylalanine degradation 12 

phosphatidylinositol phosphate biosynthesis 21 

protein modifications 12 

salvage pathways of adenine, hypoxanthine, and their nucleosides 1 1 

sphingolipid metabolism 23 

superpathway of glucose fermentation 14 

tryptophan degradation 12 

valine degradation 1 1 



K=10 K = 15 K = 20 K = 25 




Number of projections 
Figure 2 GLL of the alpha dataset on the KEGG and SGD when 10 projections are used. 
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assumption is not true. However, when the weight w 
increases, the final performance does not always 
increase and w = 0.5 produces a local minimum of the 
GLL value. After that, the GLL value continue to 
increase to the next local maximum of the GLL value. 
The SGD reference cluster reconfirms that the GO- 
based constraint is more reliable. The best clustering 
performance occurs often when w = 0.9 on average. 
Therefore, we adopt the weight w = 0.8 for the simulta- 
neous projection in all our experiments. 

As far as Figure 3 is concerned, one major reason why 
GO information is more reliable for clustering is that 
the reference gene clusters from KEGG and SGD 
(Tables 1 and 2) are partly correlated with GO annota- 
tions. Therefore, we need to delete a certain fraction of 
GO annotations when perform clustering, and use only 
the gene expression constraint to predict the new gene 
functions compared with reference gene clusters. In this 
paper, we adopt the cross-validation procedure [10] to 
validate the POCS-based MGC method. More specifi- 
cally, we perform a five-fold cross-validation by deleting 
20% GO constraints from the datasets in turn. We shall 
examine whether the POCS-based MGC clustering 
method can predict the functions for those 20% genes 
without GO constraints as compared to reference 
KEGG and SGD gene clusters. We repot the average 
prediction performance for the five-fold cross-validation. 

After we fix M = 3 and w = 0.8, we compare our 
POCS-based MGC method with three state-of-the-art 



MGC methods: k-medoids [10], ICM [12] and FCM 
[13]. Both k-medoids and ICM first linearly combine 
two constraints v\> and v\> , and then use the ICM and 
k-medoids algorithms to partition the genes into differ- 
ent clusters. We empirically determine the linear combi- 
nation weight of the GO constraint w = 0.9 for 
k-medoids, which can produce the desirable clustering 
results in terms of GLL on average. For the ICM algo- 
rithm [12], we choose the best recommended parameter 
w = 0.2, which is biased toward the gene expression 
constraint. On the other hand, FCM uses GO annota- 
tions to initialize X 0 , and uses both initial X 0 and gene 
expression values to update X 0 until it converges to a 
new clustering solution X M ■ We use the best suggested 
weight w = 0.8 for FCM [13], which is biased toward 
the GO constraint for soft clustering. 

Tables 3, 4, 5 and 6 show the average clustering perfor- 
mance and standard deviation in terms of GLL and NMI 
based on soft clustering solution X and the hard cluster- 
ing solution X a , respectively. We see that the POCS pro- 
duces the highest GLL value among all MGC methods, 
which means that its soft clustering solution is the most 
likely to generate both KEGG and SGD reference clus- 
ters. The k-medoids algorithm performs the worst, partly 
because it is easy to fall into the local optimal clustering 
solution. ICM uses an iterative procedure to find a better 
clustering solution by the combined constraint, but it is 
biased to the unreliable gene expression constraint. FCM 
performs slightly better than ICM partly because it is 
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Table 3 Five-fold cross-validation of the GLL values on 
KEGG clusters 





POCS 


K IIICUUIU9 L 1 UJ 


1 V_l VI [ 1 £.] 


FCM [13] 


(a J K = 11) 










alpha 


-198 + 8 


-354 ± 9 


-253 ± 14 


-238 ± 14 


rrlrl ^ 


1 OA + 7 


379 + 11 

J/Z I zz 


zzu zL o 


1M + ]1 
Z 1 Z ZL 1 Z 


cdc28 


-200 + 9 


-340 + 20 


-265 + 8 


-244 + 1 0 




-199 + 6 


-355 + 14 


-253 + 10 


-228 + 1 0 


n uy i itrb 


-191 + 4 


-329 + 1 7 


-212 + 5 


-1 96 + 9 


(DJ K = Id 










alpha 


-184 + 6 


-415 ± 32 


-282 ± 1 2 


-262 ± 1 6 


rrlrl ^ 


1 on _|_ yl 
1 OZ It '4- 


ZL1 3 4- 
t I J 1 Zo 


119. + 1 n 


ZJJ ZL I J 


cdc28 


-189 + 9 


-424 + 18 


-294 + 9 


-271 + 1 1 


G U 


-187 + 9 


-410 + 35 


-297 + 1 1 


-291 + 13 


1— 1 1 in hpc 


-180 + 8 


-401 + 9 


-262 + 6 


-234 + 1 0 


(Cj K = zU 










alpha 


-243 ± 1 0 


-461 ± 27 


-288 ± 1 1 


-254 ± 22 


edd 5 


-225 ± 10 


-460 ± 26 


-271 ± 8 


-246 ± 14 


cdc28 


-248 ± 8 


-478 ± 33 


-301 ± 9 


-270 ± 1 0 


elu 


-259 ± 10 


-476 ± 35 


-304 ± 7 


-286 ± 13 


Hughes 


-222 ± 6 


-455 ± 34 


-276 ± 9 


-239 ± 13 


(d) K = 25 










alpha 


-304 ± 13 


-494 ± 26 


-363 ± 18 


-328 ± 1 3 


cdc15 


-302 ± 6 


-491 ± 41 


-369 ± 12 


-331 ± 17 


cdc28 


-298 ± 8 


-444 ± 37 


-363 ± 14 


-326 ± 1 7 


elu 


-321 ± 7 


-535 ± 23 


-378 ± 9 


-342 ± 1 3 


Hughes 


-284 ± 7 


-478 ± 1 9 


-351 ± 11 


-319 ± 11 



biased to the more reliable GO constraint. Compared 
with FCM, POCS significantly increases the GLL value 
around 15% on both KEGG and SGD reference clusters. 
Another observation is that the Hughes dataset has the 
highest GLL value, partly because it contains much 
longer gene expression profiles than alpha, cdcl5, cdc28 
and elu datasets. The longer gene expression profiles are 
more reliable for gene clustering. The NMI values are 
consistent with the GLL values, where if the soft cluster- 
ing solution has a higher GLL value the corresponding 
hard clustering solution by the winner-take-all strategy 
also has a higher NMI value. Thus, the performance mea- 
sure GLL can best account for this soft clustering solu- 
tion, where the higher GLL value corresponds to better 
soft clustering solution. However, we observe that the 
GLL value varies much more than the NMI value, mainly 
because the soft clustering solution space is larger than 
that of the hard clustering. In some cases, the difference 
of NMI values between POCS and FCM is not significant. 
Thus, we need to examine the statistical significance in 
the difference of NMI values between POCS and FCM. 
Table 7 shows the p-values of pairwise f-test [38] over all 
five microarray datasets, which indicates that the NMI 
value of POCS is higher than the corresponding FCM 



results with a statistical significance of more than 99% for 
all datasets. 

To further confirm the effectiveness of POCS-based 
MGC method, we show two clustering examples. First, 
the gene YPR145W involves two KEGG pathways 
"Amino acid metabolism" and "Energy metabolism" in 
Table 1. All other MGC algorithms misclassify this gene 
into a single cluster, but our POCS algorithm success- 
fully classify it into two clusters with probabilities 0.7 
and 0.3. This example confirms the effectiveness of our 
method for identifying genes in multiple functions. Sec- 
ond, we examine the gene YJL052W involving two SGD 
pathways "glycolysis" and "gluconeogenesis" in Table 2. 
We compute the p-values between each gene function 
in GO and the cluster (alpha dataset when K = 10) con- 
taining the gene YJL052W using Gene Ontology Term 
Finder http://db.yeastgenome.org/cgi-bin/GO/goTerm- 
Finder.pl. We then rank the gene functions according to 
their p-values, and the top function is assigned to the 
gene cluster. We find that the top function is "glycoly- 
sis" with the p-value 3.12e - 41, which is consistent with 
one of SGD pathways in which YJL052W involves. This 
example further confirms that the discovered clusters 
indeed reflect the true biological functions in terms of 
pathways. 

Conclusion 

This paper presents a novel MGC method within the 
generalized POCS framework, which successfully com- 
bines two constraints from different nature for gene 
clustering. In addition, we also propose the GLL to mea- 
sure the soft clustering performance. Experimental 
results of five-fold cross-validation on different microar- 
ray datasets show that the POCS-based MGC method is 
competitive or superior to other state-of-the-art MGC 
methods based on KEGG and SGD reference gene clus- 
ters. In the future, we aim to incorporate more con- 
straints such as DNA sequence features and gene 
network structures to improve gene clustering perfor- 
mance further. For example, the structural profiles of 
DNA sequences play important roles in key genetic pro- 
cesses such as transcription [39], replication [40], pro- 
tein-DNA recognition [41], and tissue specificity [42]. 
We may use the similarity between structural profiles of 
DNA sequences as a new constraint for gene clustering. 
On the other hand, we may also develop more efficient 
supervised learning strategies to automatically determine 
the weights of simultaneous projections in Algorithm 1. 
For example, we may choose decision trees [43] or 
ensemble learning methods [44] to learn the weights of 
different constraints from training data, and apply these 
weights to clustering unknown genes for function 
prediction. 
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Table 4 Five-fold cross-validation of the NMI values on KEGG clusters 



Datasets 


POCS 


k-medoids [10] 


ICM [12] 


FCM [13] 


(a) K= 10 










alpha 


0.287 ± 0.008 


0.234 ± 0.007 


0.251 ± 0.005 


0.265 ± 0.005 


cdc15 


0.282 ± 0.003 


0.222 ± 0.009 


0.259 ± 0.002 


0.268 ± 0.009 


cdc28 


0.267 ± 0.009 


0.226 ± 0.005 


0.209 ± 0.003 


0.236 ± 0.003 


elu 


0.263 ± 0.006 


0.219 ± 0.004 


0.215 ± 0.001 


0.240 ± 0.006 


Hughes 


0.289 ± 0.006 


0.238 ± 0.007 


0.254 ± 0.007 


0.271 ± 0.005 


(b) K= 15 










alpha 


0.310 ± 0.009 


0.255 ± 0.010 


0.260 ± 0.010 


0.283 ± 0.007 


cdc15 


0.305 ± 0.004 


0.266 ± 0.004 


0.278 ± 0.012 


0.281 ± 0.001 


cdc28 


0.301 ± 0.001 


0.266 ± 0.009 


0.263 ± 0.008 


0.279 ± 0.001 


elu 


0.292 ± 0.007 


0.234 ± 0.002 


0.244 ± 0.006 


0.264 ± 0.009 


Hughes 


0.322 ± 0.003 


0.286 ± 0.001 


0.285 ± 0.007 


0.303 ± 0.008 


(c) K = 20 










alpha 


0.382 ± 0.005 


0.331 ± 0.001 


0.335 ± 0.007 


0.361 ± 0.004 


cdc15 


0.384 ± 0.002 


0.339 ± 0.004 


0.341 ± 0.003 


0.367 ± 0.004 


cdc28 


0.361 ± 0.003 


0.322 ± 0.001 


0.336 ± 0.009 


0.350 ± 0.007 


elu 


0.354 ± 0.007 


0.311 ± 0.002 


0.325 ± 0.003 


0.342 ± 0.003 


Hughes 


0.396 ± 0.009 


0.326 ± 0.003 


0.356 ± 0.005 


0.376 ± 0.009 


(d) K = 25 










alpha 


0.348 ± 0.008 


0.307 ± 0.008 


0.321 ± 0.008 


0.339 ± 0.007 


cdc15 


0.353 ± 0.005 


0.312 ± 0.002 


0.309 ± 0.009 


0.330 ± 0.009 


cdc28 


0.351 ± 0.003 


0.316 ± 0.009 


0.302 ± 0.009 


0.336 ± 0.006 


elu 


0.338 ± 0.007 


0.290 ± 0.007 


0.308 ± 0.002 


0.325 ± 0.005 


Hughes 


0.358 ± 0.007 


0.320 ± 0.004 


0.323 ± 0.005 


0.343 ± 0.004 


Table 5 Five-fold cross-validation of the GLL values on SGD clusters 


Datasets 


POCS 


k-medoids [10] 


ICM [12] 


FCM [13] 


(a) K= 10 










alpha 


-49 ± 3 


-146 ± 8 


-66 ± 2 


-62 ± 2 


cdc15 


-47 ± 1 


-148 ± 13 


-67 ± 3 


-61 ± 3 


cdc28 


-50 ± 2 


-154 ± 14 


-79 ± 3 


-64 ± 3 


elu 


-52 ± 3 


-152 ± 9 


-69 ± 4 


-61 ± 3 


Hughes 


-43 ± 3 


-143 ± 11 


-65 ± 4 


-55 ± 3 


(b) K= 15 










alpha 


-42 ± 3 


-171 ±4 


-69 ± 1 


-64 ± 2 


cdc15 


-40 ± 1 


-172 ± 4 


-78 ± 4 


-59 ± 3 


cdc28 


-43 ± 3 


-169 ± 10 


-79 ± 3 


-64 ± 4 


elu 


-43 ± 1 


-170 ± 13 


-80 ± 3 


-62 ± 3 


Hughes 


-39 ± 3 


-167 ± 14 


-62 ± 4 


-53 ± 4 


(c) K = 20 










alpha 


-71 ± 3 


-190 + 8 


-86 ± 2 


-82 ± 2 


cdc15 


-74 ± 3 


-194 ± 16 


-89 ± 6 


-79 ± 5 


cdc28 


-67 ± 3 


-188 ± 14 


-87 ± 2 


-71 ± 2 


elu 


-82 ± 6 


-197 ± 6 


-89 ± 2 


-88 ± 2 


Hughes 


-64 ± 4 


-182 ± 11 


-81 ± 5 


-70 ± 4 


(d) K = 25 










alpha 


-64 ± 2 


-216 ± 9 


-91 ± 2 


-78 ± 2 


cdc15 


-65 ± 4 


-213 ± 17 


-89 ± 6 


-80 ± 6 


cdc28 


-62 ± 3 


-216 ± 11 


-89 ± 2 


-77 ± 3 


elu 


-72 ± 3 


-219 ± 14 


-93 ± 2 


-85 ± 4 


Hughes 


-63 ± 5 


-204 ± 8 


-84 ± 5 


-67 ± 4 
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Table 6 Five-fold cross-validation of the NMI values on SGD clusters 



Datasets 


POCS 


k-medoids [10] 


ICM [12] 


FCM [13] 


(a) K= 10 










alpha 


0.438 ± 0.008 


0.383 ± 0.002 


0.404 ± 0.002 


0.408 ± 0.003 


cdc15 


0.462 ± 0.001 


0.389 ± 0.004 


0.422 ± 0.003 


0.429 ± 0.004 


cdc28 


0.428 ± 0.005 


0.387 ± 0.001 


0.400 ± 0.002 


0.41 1 ± 0.004 


elu 


0.432 ± 0.006 


0.410 ± 0.004 


0.41 1 ± 0.003 


0.412 ± 0.004 


Hughes 


0.467 ± 0.004 


0.414 ± 0.009 


0.434 ± 0.003 


0.439 ± 0.003 


(b) K= 15 










alpha 


0.533 ± 0.003 


0.471 ± 0.006 


0.507 ± 0.004 


0.517 ± 0.004 


cdc15 


0.572 ± 0.002 


0.507 ± 0.005 


0.528 ± 0.005 


0.540 ± 0.003 


cdc28 


0.552 ± 0.001 


0.488 ± 0.004 


0.524 ± 0.005 


0.543 ± 0.003 


elu 


0.536 ± 0.008 


0.466 ± 0.004 


0.514 ± 0.003 


0.525 ± 0.003 


Hughes 


0.566 ± 0.001 


0.513 ± 0.007 


0.549 ± 0.003 


0.546 ± 0.005 


(c) K = 20 










alpha 


0.607 ± 0.003 


0.551 ± 0.003 


0.579 ± 0.004 


0.583 ± 0.004 


cdc15 


0.613 ± 0.001 


0.543 ± 0.005 


0.580 ± 0.003 


0.587 ± 0.004 


cdc28 


0.598 ± 0.002 


0.551 ± 0.003 


0.587 ± 0.004 


0.586 ± 0.005 


elu 


0.593 ± 0.001 


0.539 ± 0.006 


0.567 ± 0.004 


0.564 ± 0.003 


Hughes 


0.638 ± 0.004 


0.576 ± 0.003 


0.586 ± 0.005 


0.591 ± 0.003 


(d) K = 25 










alpha 


0.649 ± 0.002 


0.586 ± 0.004 


0.636 ± 0.004 


0.634 ± 0.006 


cdc15 


0.648 ± 0.006 


0.594 ± 0.005 


0.621 ± 0.004 


0.620 ± 0.005 


cdc28 


0.661 ± 0.003 


0.607 ± 0.005 


0.630 ± 0.005 


0.637 ± 0.006 


elu 


0.637 ± 0.004 


0.607 ± 0.008 


0.619 ± 0.006 


0.621 ± 0.005 


Hughes 


0.667 ± 0.003 


0.617 ± 0.009 


0.637 ± 0.004 


0.646 ± 0.005 



Table 7 P-values of pairwise f-test of POCS and FCM 



Number of clusters K KEGG SGD 



10 


1 .60e-3 


1 .1 Oe-3 


15 


1 29e-4 


1.25e-2 


20 


1 .30e-3 


8.1 Oe-3 


25 


2.80e-3 


1.00e-3 



Acknowledgements 

Great thanks are due to Xiao-Qin Cao and Xiao-Yu Zhao for their assistance 
in code implementation. This work is supported by the Hong Kong Research 
Grant Council (Project CityU 122607). This work is also supported by the 
National Nature Science Foundation of China (No. 60903076) and the 
Shanghai Committee of Science and Technology, China (No. 08DZ2271800 
and 09DZ2272800). 

Author details 

1 School of Computer Science and Technology, Soochow University, Suzhou 
215006, China, department of Computer Science, Hong Kong Baptist 
University, Kowloon Tong, Hong Kong. 3 School of Computer Science and 
Technology, Fudan University, Shanghai 200433, China. 4 Shanghai Key Lab of 
Intelligent Information Processing, Fudan University, Shanghai 200433, China. 
5 School of Information and Communication Technology, Griffith University, 
Gold Coast Campus, QLD 4222, Queensland, Australia. 6 Department of 
Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong. 
7 School of Electronic and Information Engineering, University of Sydney, 
NSW 2006, Australia. 

Authors' contributions 

JZ developed this methodology, carried out experiments and drafted the 
manuscript. ZSF and AWL provided useful comments on methodology and 



helped revise this manuscript. HY initiated the project and participated in 
project design and helped revise the manuscript. All authors read and 
approved the final manuscript. 

Received: 8 July 2009 Accepted: 31 March 2010 
Published: 31 March 2010 

References 

1. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display 
of genome-wide expression patterns. Proc Natl Acad Sci 1998, 

95(25):1 4863-8. 

2. Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic 
determination of genetic network architecture. Nat Genet 1999, 
22(3)281-5. 

3. Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, 
Lander ES, Golub TR: Interpreting patterns of gene expression with self- 
organizing maps: methods and application to hematopoietic 
differentiation. Proc Natl Acad Sci 1999, 96(6)2907-12. 

4. Dembele D, Kastner P: Fuzzy C-means method for clustering microarray 
data. Bioinformatics 2003, 1 9:973-980. 

5. Schliep A, Schdnhuth A, Steinhoff C: Using hidden Markov models to 
analyze gene expression time course data. Bioinformatics 2003, 19: 
1255-1263. 

6. Kerr (V1K, Churchill GA: Bootstrapping cluster analysis: assessing the 
reliability of conclusions from microarray experiments. Proc Natl Acad Sci 
2001,98(16):8961-5. 

7. Bar-Joseph Z: Analyzing time series gene expression data. Bioinformatics 
2004, 20:2493-2503. 

8. Hanisch D, Zien A, Zimmer R, Lengauer T: Co-clustering of biological 
networks and gene expression data. Bioinformatics 2002, 18(Suppl 1): 

S 145-54. 

9. Pan W: Incorporating gene functions as priors in model-based clustering 
of microarray gene expression data. Bioinformatics 2006, 22(7):795-801. 



Zeng ef al. BMC Bioinformatics 2010, 11:164 
http://www.biomedcentral.eom/1 471-21 05/1 1 /1 64 



Page 1 3 of 1 3 



10. Huang D, Pan W: Incorporating biological knowledge into distance-based 
clustering analysis of microarray gene expression data. Bioinformatics 

2006, 22(1 0):1 259-1 268. 

11. Aubry M, Monnier A, Chicault C, de Tayrac M, Galibert MD, Burgun A, 
Mosser J: Combining evidence, biomedical literature and statistical 
dependence: new insights for functional annotation of gene sets. BMC 
Bioinformatics 2006, 7:241 . 

1 2. Shiga M, Takigawa I, Mamitsuka H: Annotating gene function by 
combining expression data with a modular gene network. Bioinformatics 

2007, 23(13):i468-i478. 

1 3. Tari L, Baral C, Kim S: Fuzzy c-means clustering with prior biological 
knowledge. J Biomed Inform 2009, 42:74-81. 

14. Tritchler D, Parkhomenko E, Beyene J: Filtering genes for cluster and 
network analysis. BMC Bioinformatics 2009, 10:193. 

15. Zhu S, Zeng J, Mamitsuka H: Enhancing MEDLINE document clustering by 
incorporating MeSH semantic similarity. Bioinformatics 2009, 

25(1 5):1 944-1 951. 

16. Zhu S, Takigawa I, Zeng J, Mamitsuka H: Field independent probabilistic 
model for clustering multi-field documents. Information Processing & 
Management 2009, 45:555-570. 

1 7. Stark H, Yang Y: Vector space projections: a numerical approach to signal 
and image processing, neural nets, and optics. New York: Wiley 1998. 

18. Gan X, Liew AWC, Yan H: Microarray missing data imputation based on a 
set theoretic framework and biological knowledge. Nucleic Acids Res 
2006, 34:1608-1619. 

1 9. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF: A new method to measure 
the semantic similarity of GO terms. Bioinformatics 2007, 23:1274-1281. 

20. Zeng J, Liu ZQ: Markov Random Field-based Statistical Character 
Structure Modeling for Handwritten Chinese Character Recognition. IEEE 
Trans Pattern Anal Mach Intell 2008, 30(5)767-780. 

21. Zeng J, Liu ZQ: Type-2 fuzzy Markov random fields and their application 
to handwritten Chinese character recognition. IEEE Trans Tuzzy Syst 2008, 
16(3):747-760. 

22. Feng W, Liu ZQ: Region-Level Image Authentication Using Bayesian 
Structural Content Abstraction. IEEE Trans Image Process 2008, 

1 7(1 2):241 3-2424. 

23. Zeng J, Feng W, Xie L, Liu ZQ: Cascade Markov random fields for stroke 
extraction of Chinese characters. InfSci 2010, 180:301-31 1. 

24. Blei DM, Ng AY, Jordan Ml: Latent Dirichlet allocation. J Mach Learn Res 
2003, 3(4-5)393-1022. 

25. Zeng J, Liu ZQ: Type-2 Fuzzy Hidden Markov Models and Their 
Application to Speech Recognition. IEEE Trans Tuzzy Syst 2006, 
14(3)454-467. 

26. Ramoni MF, Sebastianidagger P, Kohane IS: Cluster analysis of gene 
expression dynamics. Proc Natl Acad Sci 2002, 99:9121-9126. 

27. Lord PW, Stevens RD, Brass A, Goble CA: Investigating semantic similarity 
measures across the Gene Ontology: the relationship between sequence 
and annotation. Bioinformatics 2003, 19:1275-1283. 

28. Adryan B, Schuh R: Gene-Ontology-based clustering of gene expression 
data. Bioinformatics 2004, 20:2851-2852. 

29. Bolshakova N, Azuaje F, Cunningham P: A knowledge-driven approach to 
cluster validity assessment. Bioinformatics 2005, 21:2546-2547. 

30. Guo X, Liu R, Shriver CD, Hu H, Liebman MN: Assessing semantic similarity 
measures for the characterization of human regulatory pathways. 
Bioinformatics 2006, 22:967-973. 

31. Wolting C, McGlade CJ, Tritchler D: Cluster analysis of protein array results 
via similarity of Gene Ontology annotation. BMC Bioinformatics 2006, 
7:338. 

32. Steuer R, Humburg P, Selbig J: Validation and functional annotation of 
expression-based clusters based on gene ontology. BMC Bioinformatics 
2006, 7:380. 

33. Schlicker A, Domingues FS, Rahnenfuhrer J, Lengauer T: A new measure 
for functional similarity of gene products based on Gene Ontology. BMC 
Bioinformatics 2006, 7:302. 

34. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martinez-Cruz LA, 
Corrales FJ, Rubio A: Correlation between gene expression and GO 
semantic similarity. IEEE/ACM Trans Comput Biol Bioinform 2005, 2:330-338. 

35. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Elsen MB, Brown PO, 
Botstein D, Futcher B: Comprehensive identification of cell cycle- 
regulated genes of the yeast Saccharomyces cerevisiae by microarray 
hybridization. Mol Biol Cell 1998, 9:3273-3297. 



36. Hughes TR, ef al: Functional discovery via a compendium of expression 
profiles. Ce// 2000, 102:109-26. 

37. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, 
Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: 
new developments in KEGG. Nucleic Acids Res 2006, 34:D354-D357. 

38. Kreyszig E: Introductory Mathematical Statistics. New York: John Wiley & 
Sons 1970. 

39. Cao XQ, Zeng J, Yan H: Structural property of regulatory elements in 
human promoters. Phys Rev E 2008, 77:041908. 

40. Cao XQ, Zeng J, Yan H: Structural properties of replication origins in 
yeast DNA sequences. Phys Biol 2008, 5:036012. 

41. Cao XQ, Zeng J, Yan H: Physical signals for protein-DNA recognition. 
Phys. Biol 2009,6:036012. 

42. Zeng J, Cao XQ, Zhao H, Yan H: Finding human promoter groups based 
on DNA physical properties. Phys Rev E 2009, 80:041917. 

43. Zeng J, Zhao XY, Cao XQ, Yan H: SCS: Signal, context and structure 
features for genome-wide human promoter recognition. IEEE/ACM Trans 
Comput Biol Bioinform 2010. 

44. Zeng J, Zhu S, Yan H: Towards accurate human promoter recognition: a 
review of currently used sequence features and classification methods. 
Briefings in Bioinformatics 2009, 1 0(5)498-508. 



doklO.l 1 86/1471 -21 05-1 1 -1 64 

Cite this article as: Zeng ef al:. Multiconstrained gene clustering based 
on generalized projections. BMC Bioinformatics 2010 11:164. 



Submit your next manuscript to BioMed Central 
and take full advantage of: 

• Convenient online submission 

• Thorough peer review 

• No space constraints or color figure charges 

• Immediate publication on acceptance 

• Inclusion in PubMed, CAS, Scopus and Google Scholar 

• Research which is freely available for redistribution 



Submit your manuscript at 
www.biomedcentral.com/submit 



BioMed Central 



